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ABSTRACT 

The  features  based  on  the  MEL  cepstrum  have  long  dom¬ 
inated  probabilistic  methods  in  automatic  speech  regogni- 
tion  (ASR).  This  feature  set  has  evolved  to  maximize  general 
ASR  performance  within  a  Bayesian  classifier  framework  us¬ 
ing  a  common  feature  space.  Now,  however,  with  the  advent 
of  the  PDF  projection  theorem  (PPT)  and  the  class-specific 
method  (CSM),  it  is  possible  to  design  features  separately 
for  each  phoneme  and  compare  log-likelihood  values  fairly 
across  various  feature  sets.  In  this  paper,  class-dependent 
features  are  found  by  optimizing  a  set  of  frequency-band 
functions  for  projection  of  the  spectral  vectors,  analogous 
to  the  MEL  frequency  band  functions,  individually  for  each 
class.  Using  this  method,  we  show  significant  improve¬ 
ment  over  standard  MEL  cepstrum  methods  in  speaker  and 
phoneme  specific  recognition. 

1.  INTRODUCTION 

The  MEL  cepstrum  features  [1]  and  its  derivatives  have  long 
been  the  staple  of  automatic  speech  recogniton  (ASR)  sys¬ 
tems.  One  may  write  the  MEL  cepstrum  features  as 

z  =  DCT(log(A'y)),  0) 

where  vector  y  is  the  length-A/2  +  1  spectral  vector,  the 
magnitude-squared  DFT  output  and  the  columns  of  A  are 
the  MEL  band  functions  [1].  The  logarithm  and  the  discrete 
cosine  transform  (DCT)  are  invertible  functions.  There  is 
no  dimension  reduction  or  information  loss  so  they  may  be 
considered  a  feature  conditioning  step  which  results  in  more 
Gaussian-like  and  independent  features.  Thus,  we  may  con¬ 
centrate  our  attention  on  the  matrix  multiplication 

w  =  A'y.  (2) 

The  key  operation  here  is  dimension  reduction  by  linear  pro¬ 
jection  onto  a  lower-dimensional  space.  Now,  with  the  in¬ 
troduction  of  the  class-specific  method  (CSM)  and  the  PDF 
projection  theorem  (PPT)  [2],  one  is  free  to  explore  class- 
dependent  features  within  the  rigid  framework  of  Bayesian 
classification.  Some  work  has  been  done  in  class-dependent 
features  [3], [4]  however  existing  approaches  are  only  able  to 
use  different  features  through  the  use  of  compentation  factors 
to  make  likelihood  comparisons  fair.  Such  approaches  work 
if  the  class-dependent  feaure  transformations  are  restricted  to 
certain  limited  sets.  Both  methods  fall  short  of  the  potential 
of  the  PPT  which  makes  no  restriction  on  the  type  of  feature 


transformations  available  to  each  phoneme.  Under  CSM,  the 
“common  feature  space”  is  the  time-series  (raw  data)  itself. 
Feature  PDFs,  evaluated  on  different  feature  spaces  are  pro¬ 
jected  back  to  the  raw  data  space  where  the  likelihood  com¬ 
parison  is  done.  Besides  its  generality,  the  CSM  paradigm 
has  many  additional  advantages  as  well.  For  example  there 
is  a  quantitative  class-dependent  measure  to  optimize  that  al¬ 
lows  the  design  of  the  class-dependent  features  in  isolation, 
without  regard  to  the  other  classes. 

2.  CLASS-SPECIFIC  APPROACH 

When  applying  CSM,  one  must  find  class-dependent  signal 
processing  to  produce  features  that  characterize  each  class 
or  sub-class.  We  seek  an  automatic  means  of  optimizing  the 
matrix  A  for  a  given  subclass.  We  first  review  CSM. 

2.1  Class-Specific  Method  (CSM) 

Let  there  be  M  classes  among  which  we  would  like  to  clas¬ 
sify.  The  class-specific  classifier,  based  on  the  PPT,  is  given 
by 

arg  max  pp(x\Hm), 

m 

where  pp(x\Hm)  is  the  projected  PDF  (projected  from  the 
feature  space  to  the  raw  data  space).  The  projected  PDF  is 
given  by 


Ppip^\Brn)  — Tm(x,  Am,/fo,m)  pi^nt  \ H m ) ,  (3) 

where  p(z,„\Hm)  is  the  feature  PDF  estimate  (estimated  from 
training  data)  and  the  J-function  is  given  by 


Jm  (A.  A„, .  //()  ,m) 


Pix\Ho,m) 

p{zm  \H()m ) 


(4) 


and  Ho,m  are  class-dependent  reference  hypotheses.  The 
class-dependent  features  z,„  are  computed  from  the  spectral 
vector  y  through  the  class-dependent  subspace  matrices  Am, 
as 

zm  =  C(A'my),  (5) 

where  C  is  the  feature  conditioning  transformation.  Note  that 
the  J-function  is  a  fixed  function  of  x  precicely  defined  by 
the  feature  transformation  from  x  to  z  and  the  reference  hy¬ 
potheses  H()  rn.  It  is  the  “correction  term”  that  allows  feature 
PDFs  from  various  feature  spaces  to  be  compared  fairly  be¬ 
cause  the  resulting  log-likelihood  function  is  a  PDF  on  the 
raw  data  space  x.  The  J-function  is  a  generalization  of  the 
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determinant  of  the  Jacobian  matrix  in  the  case  of  a  1:1  trans¬ 
formation.  The  PPT  guarantees  that  pp(x.\Hm)  given  by  (3)  is 
a  PDF,  so  it  integrates  to  1  over  x  regardless  of  the  reference 
hypothesis  //()  „,  or  the  feature  transformation  producing  zm 
from  x.  It  is  up  to  the  designer  to  choose  //()  ,„  and  A,„  to 
make  pp(x.\Hm)  as  good  an  estimate  of  p(x.\Hm)  as  possible. 
The  designer  is  guided  by  the  principle  that  if  zm  is  a  suffient 
statistic  for  H,„  vs.  Hqmi,  then  pp(x.\Hm)  will  equal  p(x.\Hm) 
(provided  p(zm\Hm)  is  a  good  estimate).  We  can  also  think 
of  it  as  a  way  of  imbedding  a  low-dimensional  PDF  within  a 
high-dimensional  PDF. 

We  have  good  reason,  as  we  shall  see,  to  use  a  common 
reference  hypothesis.  Ho,  which  simplifies  the  classifier  to 

arg max7,„ (x,  Am  ,H0)  p (zm  | Hm )  (6) 

m 

where  the  J-function  Jm(x)  now  depends  only  on  A,„.  Note 
that  in  contrast  to  other  class-dependent  schemes  using  pair¬ 
wise  or  tree  tests,  CSM  is  a  Bayesian  classifier  and  has 
the  promise  CSM  of  providing  a  “drop-in”  replacement  to 
the  MEL-cepstrum  based  feature  processors  in  existing  ASR 
systems. 

2.2  Finding  a  class-specific  subspace 

We  are  interested  in  adapting  the  matrix  A  to  an  individ¬ 
ual  class.  We  propose  the  strategy  of  selecting  A,„  to  max¬ 
imize  the  total  log-likelihood  of  the  training  data  using  the 
projected  PDF.  Let 

K 

L(x1,x2...x^;Am)  =  £  logpp(xl\Hm)  (7) 

i=  1 

where  K  is  the  number  of  training  vectors.  If  we  expand 

pp(x\Hm)  , 

_  A  r  p(*\ho)  i  , 

Pp(x|Ff»i)  —  /  „  \ 

_PyZm  \Hq  J  _ 

where  Hq  is  the  independent  Gaussian  noise  hypothesis,  we 
see  that  the  term  p(x\Hq)  is  independent  of  A,„.  Thus,  to 
maximize  L.  we  need  to  maximize  the  average  value  of 

logp(zm\Hm) -\ogp(zm\H0).  (8) 

Our  approach  is  to  assume  that  the  first  term  in  (8)  is  only 
weakly  dependent  on  A,„  and  concentrate  on  the  second 
term.  Given  the  simplicity  of  the  reference  hypothesis  Hq,  the 
second  term  p(zm\Ho)  can  be  known,  either  in  analytic  form 
or  in  an  accurate  analytic  approximation  [5].  Thus,  it  is  easy 
to  analyze  its  behavior  as  A„,  changes.  We  have  obtained  the 
first  derivatives  of  log  p(zm \Hq)  with  respect  to  each  element 
of  A,„.  We  proceed,  then  by  ignoring  the  term  p{zm\Hm)  and 
maximizing  the  function 

Q(x1,x2...xa;A,„)  =  logp(z,'„|//0).  (9) 

1=1 

The  change  in  p(zm \Hm)  can  be  minimized  as  Am  is  changed 
by  insisting  on  an  orthonormal  form  for  A,„.  Thus,  by  max¬ 
imizing  L  (7)  under  the  restriction  that  Am  is  orthonormal, 
we  approximately  maximize  L.  We  apply  the  following  con¬ 
straints  to  Am: 


•  Orthonormality.  The  columns  of  Am  are  an  orthonor¬ 
mal  set  of  vectors.  We  use  a  orthonormality  under  the 
inner  product 

N/2 

<x.y  >  E  £ix<yi , 

;=o 

where  £,■  has  the  value  2  except  for  the  end  bins  (0  and 
N/2)  where  it  has  value  1.  Ortho-normality  under  this 
inner  product  means  that  the  spectral  vectors  will  be  or¬ 
thonormal  if  extended  to  the  full  N  bins.  Use  of  orthonor¬ 
mality  helps  to  stabilize  the  term p(zm\Hm)  as  A,„  is  var¬ 
ied. 

•  Energy  sufficiency.  The  energy  sufficiency  constraint 
means  that  the  total  energy  in  x, 

N 

£  =  E 

i—  1 

can  be  derived  from  the  features.  Energy  sufficiency  is 
important  in  the  context  of  floating  reference  hypotheses 
[2],  In  order  that  the  classifier  result  is  scale  invariant, 
we  need  energy  sufficiency.  With  energy  sufficiency,  the 
term 

p(x\Hp) 

p[zm\Ho) 

will  be  independent  of  the  variance  used  on  the  Ho  ref¬ 
erence  hypothesis.  Note  that  E  =  e[y/N,  where  ei  = 
[1,2, 2, 2. . .  ,2, 1]',  which  is  composed  of  the  number  of 
degrees  of  freedom  in  each  frequency  bin.  Thus,  energy 
sufficiency  means  that  the  column  space  of  A,„  needs  to 
contain  the  vector  ei . 

2.2.1  Class-specific  iterated  subspace  ( CSIS) 

Since  we  would  like  the  feature  set  created  by  projecting 
onto  the  columns  of  A  to  characterize  the  statistical  varia¬ 
tions  within  the  class,  a  natural  first  step  is  to  use  principal 
component  analysis  (PCA).  To  do  this,  we  arrange  the  spec¬ 
tral  vectors  from  the  training  set  into  a  matrix 

X  =  [y'y2  •  ••y*'], 

where  K  is  the  number  of  training  vectors.  To  meet  the  en¬ 
ergy  sufficiency  constraint,  we  fix  the  first  column  of  A  to  be 
the  normalized  ei 


ei 


To  find  the  best  linear  subspace  orthogonal  to  ej,  we  first 
orthogonalize  the  columns  of  X  to  ei  X„  =  X  —  e^ej'X). 
Let  U  be  the  largest  P  singular  vectors  of  X„,  or  equivalently 
the  largest  P  eigenvectors  of  X„X[r  We  then  set  A  =  [e  |  U] . 
We  then  proceed  to  maximize  (9)  using  an  iterative  approach. 
We  use  the  term  class-specific  iterated  subspace  (CSIS)  to 
refer  to  the  columns  of  A,„  obtained  in  this  way. 

3.  EXPERIMENTAL  APPROACH 
3.1  Data  Set 

We  used  the  TIMIT  [6]  data  set  as  a  source  of  phonemes, 
drawing  all  of  our  data  from  the  “training”  portion.  TIMIT 
consists  of  sampled  time-series  (in  16  kHz  .wav  files)  of 


scripted  sentences  read  by  a  wide  variety  of  speakers  and  in¬ 
cludes  index  tables  that  point  to  start  and  stop  samples  of 
each  spoken  phoneme  in  the  text.  There  are  61  phonemes  in 
the  database,  having  a  1  to  4  character  code.  We  use  the  term 
dataclass  to  represent  the  collection  of  all  the  phonemes  of 
a  given  type  from  a  given  speaker.  The  average  number  of 
samples  (utterences)  of  a  given  speaker/phoneme  combina¬ 
tion  is  about  10  and  ranges  from  1  up  to  about  30  for  some 
of  the  most  common  phonemes.  We  used  speaker/phoneme 
combinations  with  no  fewer  than  10  samples. 

3.2  Cross-Validation 

In  all  of  our  classification  experiments,  the  utterences  of 
a  given  speaker/phoneme  were  divided  into  two  sets,  even 
(samples  2,4,6  ...)  and  odd  (samples  1,3,5...).  We  conducted 
two  sub-experiments,  training  on  even,  testing  on  odd,  then 
training  on  odd,  testing  on  even.  We  reported  the  sum  of  the 
classification  counts  from  the  two  experiments. 

3.3  Processing 

We  now  describe  the  processing  for  the  features  of  the  MEL 
frequency  cepstral  coefficient  (MFCC)  classifier  and  CSIS. 
In  order  to  concentrate  on  the  basic  dimension  reduction  step 
(equation  2),  the  simplest  possible  processing  and  PDF  mod¬ 
eling  was  used.  Each  step  in  the  processing  is  described  be¬ 
low,  in  the  order  in  which  it  is  processed. 

3.3.1  Resampling 

We  pre-processed  all  TIMIT  .wav  files  by  resampling  from 
16  kHz  down  to  12  kHz.  Phoneme  endpoints  were  corre¬ 
spondingly  converted  and  used  to  select  data  from  the  12  kHz 
time-series. 

3.3.2  Truncation 

The  phoneme  data  was  truncated  to  a  multiple  of  384  sam¬ 
ples  by  truncating  off  the  end.  Those  phoneme  events  that 
were  below  384  samples  at  12  kHz  were  dropped.  Doing 
this  allowed  us  to  use  FFT  sizes  of  48,  64,  96,  128,  or  192 
samples,  which  are  all  factors  of  384. 

3.3.3  FFT  processing 

We  computed  non-overlapped  unshaded  (rectangular  win¬ 
dow  function)  FFTs  resulting  in  a  sequence  of  magnitude- 
squared  FFT  spectral  vectors  of  length  N/2+ l,  where  N  is 
the  FFT  size.  The  number  of  FFTs  in  the  sequence  depended 
on  how  many  non-overlapped  FFTs  fit  within  the  truncated 
phoneme  utterance. 

3.3.4  Spectral  normalization 

Spectral  vectors  were  normalized  after  FFT  processing.  For 
non-speaker-dependent  (MEL  cepstrum)  features,  the  spec¬ 
tral  vectors  were  normalized  by  the  average  spectrum  of  all 
available  data. 

For  CSIS  (speaker-dependent)  features,  the  spectral  val¬ 
ues  for  each  speaker/phoneme  combination  were  normalized 
by  the  average  spectrum  for  that  speaker/phoneme.  In  clas¬ 
sification  experiments  the  average  spectrum  was  computed 
from  the  training  data  to  avoid  issues  of  data  separation. 


3.3.5  Subspace  Projection  (Matrix  Multiplication) 

Next,  the  spectral  vectors,  denoted  by  y,  were  projected  onto 
a  lower  dimensional  subspace  by  a  matrix  as  in  (2)  resulting 
in  feature  vectors,  denoted  by  w. 

For  MFCC,  the  columns  of  A  were  MEL  frequency  band 
functions.  The  number  of  columns  in  matrix  A  was  Nc+  2 
including  the  zero  and  Nyquist  half-bands. 

For  CSIS,  A  was  an  orthonormal  matrix  determined 
from  the  optimization  algorithm.  For  CSIS,  the  number  of 
columns  of  A  was  P  + 1  where  P  is  the  number  of  basis  func¬ 
tions  in  addition  to  the  first  column  6| . 

3.3.6  Feature  Conditioning 

From  a  statistical  point  of  view,  feature  conditioning  has 
effect  on  the  information  content  of  the  features.  It  does, 
however,  make  probability  density  function  (PDF)  estimation 
easier  if  the  resulting  features  are  approximately  independent 
and  Gaussian.  For  MFCC,  the  features  were  conditioned  by 
taking  the  log  and  DCT  as  in  (1).  For  CSIS,  features  were 
conditioned  first  by  dividing  features  2  through  P  +  1  by  the 
first  feature.  This  effectively  normalizes  the  features  since 
the  first  feature,  being  a  projection  onto  ei,  is  a  power  esti¬ 
mate  for  the  segment.  Lastly,  the  log  of  the  first  feature  is 
taken.  Mathematically,  we  have  for  CSIS 

w  =  A'y, 

Z 1  =log(wi), 

Zi=Wj/w\ ,  i  =  2,3,.  ..P+  1. 

3.3.7  J -function  calculation 

J-function  contributions  must  be  included  for  FFT 
magnitude-squared,  spectral  normalization,  matrix  mul¬ 
tiplication,  and  feature  conditioning.  See  [7]  for  details  of 
these  class-specific  modules. 

3.3.8  PDF  modeling  and  Classification 

We  used  a  simple  multivariate  Gaussian  PDF  model,  or 
equivalently  a  Gaussian  mixture  model  (GMM)  with  a  sin¬ 
gle  mixture  component.  We  assume  independence  between 
the  members  of  the  sequence  within  a  given  utterence,  thus 
disregarding  the  time  ordering.  The  log-likelihood  value  of 
a  sample  was  obtained  by  evaluating  the  total  log-likelihood 
of  the  feature  sequence  from  the  phoneme  utterance.  The 
reason  we  used  such  simplified  processing  and  PDF  mod¬ 
els  was  to  concentrate  our  discussion  on  the  features  them¬ 
selves.  Classification  was  accomplished  by  maximization  of 
log-likelihood  across  class  models.  For  CSS  and  CSIS,  we 
added  the  log  J-function  value  to  the  log-likelihood  value  of 
the  GMM  [2],  implementing  (6)  in  the  log  domain. 

4.  EXPERIMENTAL  RESULTS 
4.1  Data  Description 

We  selected  fourteen  phonemes  for  our  experiments.  For 
each  phoneme,  we  chose  a  set  of  from  four  to  seven  individ¬ 
ual  speakers  of  the  same  sex.  We  selected  phoneme/speaker 
combinations  that  had  large  numbers  of  utterences  per 
speaker  -  a  minimum  of  ten  utterences  per  speaker.  Thus, 
each  phoneme  set  consisted  of  about  60  utterences.  Phoneme 
sets  were  arranged  into  seven  pairs  for  use  in  two-phoneme 
individual  speaker  experiments. 


4.2  Basis  Function  optimization 

4.2.1  Validation  of  Assumptions 

An  important  experiment  to  perform  is  to  validate  the  as¬ 
sumption  used  in  section  2.2,  that  maximizing  L  (equation 
7)  can  be  achieved  by  maximizing  Q  in  equation  (9).  Al¬ 
though  space  does  not  permit  presenting  the  results,  we  have 
obtained  overwhelming  evidence  that  the  second  term  in  (9) 
does  in  fact  dominate. 

4.2.2  Choice  ofFFT  size  and  model  order 

The  CSIS  approach  is  parameterized  by  two  parameters,  the 
FFT  size  N,  and  the  model  order  P.  The  MFCC  method  is 
parameterized  by  the  FFT  size  N,  and  the  number  of  MEL 
bands  Nc.  We  chose  to  use  the  same  value  of  N  for  MFCC 
and  CSIS.  This  ensured  that  the  only  significant  difference 
between  MFCC  and  CSIS  would  be  the  ability  to  choose 
matrix  A,„  as  a  function  of  class  thanks  to  the  PPT.  Fea¬ 
ture  conditioning  is  also  different  but  is  not  expected  to  con¬ 
tribute  greatly  to  performance  differences.  For  fair  compari¬ 
son,  we  selected  the  FFT  size  to  maximize  the  performance 
of  MFCC,  which  turned  out  to  be  N  =  96.  For  MFCC,  we 
used  always  the  optimum  Nc  =  10. 

For  CSIS,  we  are  left  with  deciding  on  the  model  or¬ 
der  P.  Refer  to  figure  1.  In  which  we  see  the  total  log- 


Figure  1:  Total  log-likelihood  (with  even-odd  cross- 
validation)  as  a  function  of  P  for  speaker  MGRLO  phoneme 
“N”  with  CSIS. 

likelihood  L  of  speaker  MGRLO, phoneme  “N”,  as  a  function 
of  P.  Even-odd  cross-validation  is  used  (section  3.2).  Note 
that  the  likelihood  increases  up  to  P  =  5  then  exhibits  a  steep 
decline.  This  suggests  that  a  dimension-5  subspace  is  opti¬ 
mal  to  represent  this  speaker/phoneme  combination.  For  the 
individual  speaker  experiments,  we  chose  model  order  for 
each  speaker/phoneme  combination  in  the  same  way. 

To  address  the  phoneme-class  experiments  we  will  need 
to  expand  the  data  to  include  all  speakers  of  a  given  phoneme. 
We  expanded  the  data  to  all  male  speakers  of  “N”.  and  at¬ 
tained  a  peak  at  P  =  8.  This  indicates  that  an  increase  in  sub¬ 
space  dimension  is  required.  In  phoneme-class  experiments, 
we  used  a  constant  value  of  P  =  8  for  all  phoneme  classes. 

4.3  Classification  Experiments 

We  conducted  seven  individual  speaker  experiments,  each  in¬ 
volving  two  phonemes  (see  section  4.1). 

4.3.1  Performance  metrics 

Because  in  each  experiment  we  used  a  number  of  individ¬ 
ual  speakers  of  each  phoneme,  it  is  possible  to  measure 
both  inter-speaker  errors  (speaker  identity  errors)  and  inter¬ 
phoneme  errors.  We  define  the  following  performance  met¬ 
rics: 


1 .  E,  is  the  confusion  matrix  error  metric  which  is  the  num¬ 
ber  of  off-diagonal  elements  in  the  confusion  matrix. 
Thus,  it  is  a  measure  of  speaker  identity  errors  without 
regard  to  the  phoeneme. 

2.  Inter-phoneme  error  Eip  counts  the  number  of  inter¬ 
phoneme  errors. 

All  of  our  experiments  used  strict  separation  between  train¬ 
ing  and  testing  data  (section  3.2).  In  all  cases,  data  was  sep¬ 
arated  into  even  and  odd  events  (utterences).  First  all  models 
were  trained  on  odd  events,  events  1,3,5,  etc,  and  tested  on 
even  events,  2,4,6,  etc.,  then  all  models  were  trained  on  even 
events  and  tested  on  odd  events.  The  error  were  added  to 
obtain  the  aggregate  error  count. 

4.3.2  Two-Phoneme  Experiments. 

The  two-phoneme  experiments  were  designed  to  test  the  abil¬ 
ity  to  distinguish  speakers  of  a  given  phoneme  as  well  as 
classify  two  phonemes  in  a  limited  multi-speaker  environ¬ 
ment.  In  each  of  seven  the  two-phoneme  experiments,  we 
tested  both  CSIS  and  MFCC  under  two  conditions.  In  single¬ 
speaker  (SS)  classifier  training,  we  separately  trained  a 
model  on  each  speaker/phoneme  combination.  In  phoneme- 
class  (PC)  classifier  training,  we  grouped  all  speakers  of  a 
given  phoneme  into  a  single  phoneme  class.  For  the  SS  clas¬ 
sifiers,  we  measured  Ec  which  included  all  errors,  and  Eip 
which  counts  only  inter-phoneme  errors.  For  the  PC  classi¬ 
fiers,  we  could  only  measure  Eip. 

To  provide  the  most  meaninful  performance  comparison, 
we  optimized  the  performance  of  MFCC  by  finding  the  best 
combination  of  parameters  N  and  Nc  over  all  seven  experi¬ 
ments.  Metric  Elp  was  at  a  minimum  at  Nc  =  10, N  =  96.  For 
Ec,  it  was  close  to  the  minimum  at  the  same  parameter  set¬ 
ting.  Thus,  we  chose  Nc  =  10,  A  =  96  as  the  benchmark  for 
comparison. 

The  seven  experiments  tested  phonemes  “IY”  versus 
“EH”,  “AE”  versus  “EH”  ,  “R”  versus  “L”  ,  “AX”  ver¬ 
sus  “AXR”  ,  “IX”  versus  “IH”  ,  “N”  versus  “M”  ,  and 
“DCL”  versus  “TCL”  .  Between  four  and  seven  speakers  per 
phoneme  were  used  with  an  average  of  about  12  utterences 
per  speaker/phoneme  combination.  The  results  are  plotted  in 
figure  2.  First,  CSIS-SS,  with  P  chosen  separately  for  each 
class  performed  generally  better  than  CSIS-SS(5)  which  uses 
model  order  fixed  at  P  =  5.  This  indicates  that  individually 
optimized  model  order  is  better.  The  fact  that  the  model  or¬ 
ders  were  determined  individually  without  regard  to  other 
classes  validates  is  an  important  observation.  In  compari¬ 
son  to  MFCC-SS,  CSS-SS  achieved  a  lower  Ec  in  all  exper¬ 
iments.  As  a  means  of  comparison,  MFCC  produced  higher 
values  of  Ec  by  14,  22,  38,  22,  7.5,  59,  12  and  12  percent, 
an  average  of  25.5  percent  higher.  Using  the  Eip  error  met¬ 
ric,  for  which  we  have  no  space  to  report  detailed  results, 
there  was  not  much  difference  between  CSS-SS  and  MFCC- 
SS.  For  multi-speaker  training,  MFCC-PC  was  consistently 
better  than  CSS -PC. 

4.3.3  Single-Speaker  Experiments. 

The  single-speaker  experiments  were  designed  to  test  the 
ability  to  distinguish  phonemes  of  a  given  speaker.  In  each  of 
the  seventeen  single-speaker  experiments,  we  gathered  data 
from  a  single  speaker  and  between  four  and  seven  phonemes 
into  one  classification  experiment  and  measured  Ec.  The  re¬ 
sults  are  summarized  in  figure  3.  CSIS  does  generally  better 


Figure  2:  Comparison  of  MFCC  and  CSIS  using  individual 
speaker  error  metric  Ec.  The  experiment  number  is  on  the  X 
axis  in  the  following  order:  “IY”  versus  “EH”,  “AE”  versus 
“EH”  ,  “R”  versus  “L”  ,  “AX”  versus  “AXR”,  “IX”  versus 
“IH”,  “N”  versus  “M”  ,  “DCL”  versus  “TCL”.  CSIS-SS15) 
indicates  CSIS  with  model  order  fixed  at  P  =  5. 


than  MFCC  except  in  two  experiments  where  it  does  worse 
and  one  where  it  is  the  same.  The  total  number  of  errors 
across  the  seventeen  experiments  was  435  for  MFCC.  We 
first  tried  CSIS  with  model  order  fixed  at  P  =  5  (indicated 
as  CSISC5)  in  the  figure)  and  acheived  total  errors  of  385. 
We  then  selected  P  individually  by  maximizing  the  total  log- 
likelihood  (section  4.2.2)  and  acheived  361  errors,  a  reduc¬ 
tion  of  6.5  percent  and  an  improvement  of  20  percent  over 
MFCC.  This  is  significant  because  in  addition  to  matrix  A 
being  a  function  of  class,  the  feature  dimension  is  also  a  func¬ 
tion  of  class. 

4.4  Discussion  of  Results 

We  can  draw  some  meaningful  conclusions  from  the  exper¬ 
iments.  First,  we  see  that  both  in  discriminating  phonemes 
of  a  given  speaker  and  in  discriminating  speakers  of  a 
given  phoneme,  CSIS  is  clearly  better  than  MFCC.  On  the 
other  hand,  MFCC  is  generally  better  in  speaker-independent 
phoneme  discrimination.  The  reason  may  lie  in  the  shrink¬ 
ing  of  the  linear  subspace  as  we  restrict  ourselves  to  a  single 
speaker/single  phoneme.  When  the  subspace  is  limited,  CSIS 
may  be  able  to  find  a  better  statistical  model  of  the  distribu- 
tiuon.  A  second  piece  of  evidence  that  supports  this  is  the 
fact  that  the  highest  improvement  of  CSIS-SS  over  MFCC- 
SS  was  obtained  in  the  experiment  “N-vs-M”  which  is  one  of 
the  most  difficult  problems  in  ASR,  an  indication  that  CSIS 
produces  a  better  PDF  estimate  at  the  center  of  the  distri¬ 
butions.  Thus,  when  classes  are  more  close  to  each  other, 
i.e.  overlapped,  the  better  PDF  estimate  will  be  more  impor¬ 
tant,  because  the  optimal  decision  boundary  is  given  by  the 
true  likelihood  ratio.  However,  since  MFCC  has  evolved  for 
phoneme  discrimination,  it  performs  better  than  CSIS  in  the 
inter-phomeme  areas.  When  two  phonemes  are  very  similar, 
discrimination  occurs  “near  the  peak”  where  CSIS  performs 
better. 

Future  work  should  determine  how  can  the  strengths  of 
both  CSIS  and  MFCC  be  best  utilized.  The  evidence  we  pro¬ 
vided  suggests  that  the  most  promising  approach  for  apply¬ 
ing  CSIS  to  multi-speaker  experiments  may  lie  in  the  ability 
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Figure  3:  Comparison  of  MFCC  and  CSIS  in  single-speaker 
experiments  using  error  metric  Ec.  The  experiment  number 
is  on  the  X  axis  in  the  following  order:  fmemO  fcegO  mkagO 
fapbO  mcxmO  mmeaO  fdawO  mgrlO  mkddO  msatl  mbmal 
mprkO  fklhO  mjmaO  mbthO  mbcgO  mmlmO,  which  is  in  or¬ 
der  of  increasing  MFCC  error.  CSIS(5)  indicates  CSIS  with 
model  order  fixed  at  P  =  5. 


to  cluster  speakers  into  like-sounding  groups,  which  can  be 

represented  by  separate  low-dimensional  CSIS  models. 
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